Non-linear dimensionality reduction: Principal curves, local MDS, Isomap and t-SNE

PART A. Principal curves

1. Choosing the smoothing parameter in Principal Curves (Hastie and Stuetzle 1989)

Consider the 3-dimensional data set generated by the following code.

library(plot3D)
library(princurve)
t <- seq(-1.5*pi,1.5*pi,l=100)
R<- 1
n<-75
sd.eps <- .15

set.seed(1)
y <- R*sign(t) - R*sign(t)*cos(t/R)
x <- -R*sin(t/R)
z <- (y/(2*R))^2
rt <- sort(runif(n)*3*pi - 1.5*pi)
eps <- rnorm(n)*sd.eps
ry <- R*sign(rt) - (R+eps)*sign(rt)*cos(rt/R)
rx <- -(R+eps)*sin(rt/R)
rz <- (ry/(2*R))^2 + runif(n,min=-2*sd.eps,max=2*sd.eps)
XYZ <- cbind(rx,ry,rz)


require(plot3D)
lines3D(x,y,z,colvar = NULL, 
         phi = 20, theta = 60, r =sqrt(3), d =3, scale=FALSE,
         col=2,lwd=4,as=1,
         xlim=range(rx),ylim=range(ry),zlim=range(rz))
points3D(rx,ry,rz,col=4,pch=19,cex=.6,add=TRUE)

When fitting principal curves to these data, use the function princurve::principal_curve with the following options:

  • smoother="smooth_spline". This is the default, so you do not need to use it explicitely.
  • The only additional argument that you will pass to smooth_spline will be the degrees of freedom df (see `help(smooth.spline) if you want)

For instance, the following sentence

principal_curve(XYZ, df=6)

fits the required principal curve with degrees of freedom df equal to 6.

Questions

  1. Choose the value of the degrees of freedom df by leave-one-out cross-validation. Restrict the search of df to seq(2,8,by=1). (Hint: The function project_to_curve should be used. See the element dist of the object it returns).
loocv <- function(data,k){
  distance=0
  for(i in 1:dim(data)[1]){
    pred=princurve::principal_curve(data[-i,],df=k)
    pred_proj= princurve::project_to_curve(t(matrix(data[i,])),pred$s[pred$ord,])
    distance= distance+ pred_proj$dist
  }
  return(distance)
}

K=seq(2,8,by=1)
distances=c()

for(k in K){
  distances=append(distances,loocv(XYZ,k))
}

min_k=K[which.min(distances)]
plot(K, distances)
abline(v=min_k,col=2)

  1. Give a graphical representation of the principal curve output for the optimal df and comment on the obtained results.

The principal curve with the optimal df=6 fits the data by going through it in an S-like shape. It captures the data trend without grasping the points.

best_df <- min_k
best_pc_fit <- principal_curve(XYZ, df = best_df)

# Project your data onto the selected principal curve
projected_data <- project_to_curve(XYZ, best_pc_fit$s[best_pc_fit$ord, ])

# Assess the quality of the projection by examining the dist element
projection_distances <- best_pc_fit$dist

cat("Best df:", best_df, "\n")
## Best df: 6
cat("Projection Distances:", projection_distances, "\n")
## Projection Distances: 5.729252
plot(best_pc_fit,xlim=range(XYZ[,1]),ylim=range(XYZ[,2]),asp=1)
points(XYZ,col=4)

# 3D PLOT OF THE CURVE
curve_coords = best_pc_fit$s[best_pc_fit$ord,]
lines3D(curve_coords[,1],curve_coords[,2],curve_coords[,3],colvar = NULL, 
         phi = 30, theta = 60, r =sqrt(3), d =3, scale=FALSE,
         col=2,lwd=4,as=1,
         xlim=range(XYZ[,1]),ylim=range(XYZ[,2]),zlim=range(XYZ[,3]))
points3D(XYZ[,1],XYZ[,2],XYZ[,3],col=4,pch=19,cex=.6,add=TRUE)

  1. Compute the leave-one-out cross-validation error for df=50 and compare it with the result corresponding to the optimal df value you found before.
  • Before fitting the principal curve with df=50 and based only on the leave-one-out cross-validation error values, what value for df do you think that is better, the previous optimal one or df=50?
  • Fit now the principal curve with df=50 and plot the fitted curve in the 3D scatterplot of the original points.
  • Now, what value of df do you prefer?
  • The overfitting with df=50 is clear. Nevertheless leave-one-out cross-validation has not been able to detect this fact. Why do you think that df=50 is given a so good value of leave-one-out cross-validation error?
distance=loocv(XYZ,50)
cat("LOOCV for df=50: ", distance, "\n")
## LOOCV for df=50:  2.954621
cat("LOOCV for optimal df=6:", distances[min_k])
## LOOCV for optimal df=6: 9.558809

What value for df do you think that is better, the previous optimal one or df=50? Without plotting the principal curve, the cross-validation error for df=50 seems way better than the one for the optimal df. It is very small and seems almost too small to be true.

pred=princurve::principal_curve(XYZ,df=50)
lines3D(pred$s[,1],pred$s[,2],pred$s[,3],colvar = NULL, 
         phi = 20, theta = 60, r =sqrt(3), d =3, scale=FALSE,
         col=2,lwd=4,as=1,
         xlim=range(XYZ[,1]),ylim=range(XYZ[,2]),zlim=range(XYZ[,3]))
points3D(XYZ[,1],XYZ[,2],XYZ[,3],col=4,pch=19,cex=.6,add=TRUE)

What value of df do you prefer? When looking at the plot of the principal curve with df=50 it is clear to see that the curve is totally overfitting on the data we generated and does not generalize well. Because of that, we prefer our optimal df=6.

Why do you think that df=50 is given a so good value of leave-one-out cross-validation error? By increasing the degrees of freedom in the principal curve, the curve becomes more flexible/a more complex model. So, the model is able to capture noise in the data. LOOCV is sensitive to variance in the data and a more flexible model may reduce the bias leading to a low LOOCV error, but it does not generalize well on new data. LOOCV is not able to capture the overfitting because the principal curve with 50 degrees of freedom is very long and interpolates the data points, so the curve becomes dense in some way. This makes the distance to the projections of external points to the curve smaller, thus making the LOOCV error closer to zero.

Part B. Local MDS, ISOMAP and t-SNE

Consider the ZIP number data set, from the book of Hastie et al. (2009). Read the training data set (in the file zip.train) and select only the ZEROs.

There are n=1194 digits corresponding to ZEROs in the data set.

The following function plots a digit:

zip.train <- read.table("zip.train")
data=zip.train[zip.train$V1==0,][,-1]
print(dim(data)[1])
## [1] 1194
# ploting 1 digit
plot.zip <- function(x,use.first=FALSE,...){
  x<-as.numeric(x)
  if (use.first){
    x.mat <- matrix(x,16,16)
  }else{
    x.mat <- matrix(x[-1],16,16)
  }
  image(1:16,1:16,x.mat[,16:1],
        col=gray(seq(1,0,l=12)),...)
  invisible(
    if (!use.first){
      title(x[1])
    }else{
    }
  )  
  #col=gray(seq(1,0,l=2)))
}

Here you have several examples of these ZERO digits:

2. Local MDS for ZERO digits

You must apply Local MDS to reduce the dimensionality of this dataset using the function lmds from package stops. You have to install the library stops from this link and then to attach the library:

#if (!require(stops, quietly=TRUE, warn.conflicts=FALSE)){
#  install.packages("stops", repos="http://R-Forge.R-project.org",INSTALL_opts="--no-test-load")
#}

library(stops)
## Loading required package: smacof
## Loading required package: plotrix
## Loading required package: colorspace
## Loading required package: e1071
## 
## Attaching package: 'smacof'
## The following object is masked from 'package:base':
## 
##     transform
## Loading required package: rgl
## 
## Attaching package: 'rgl'
## The following object is masked from 'package:plotrix':
## 
##     mtext3d
## 
## Attaching package: 'stops'
## The following object is masked from 'package:smacof':
## 
##     torgerson
## The following object is masked from 'package:stats':
## 
##     cmdscale
# help(lmds)
  1. Look for a 2-dimensional (q=2) configuration of the data using parameters k=5 and tau=0.05 in lmds function. Do the scatterplot of the obtained 2-dimensional configuration.
x=dist(as.matrix(data))

n <- dim(x)[1]
k <- 5
tau <- .05
q <-2

conf <- stats::cmdscale(x, k=q)
zero <- lmds(as.matrix(x), init=conf, ndim=q, k=k, tau=tau, itmax = 1000)
n <- dim(x)[1]

k <- 5
tau <- .05
q<-2

# op <- par(mfrow=c(2,1))
plot(zero$conf,as=1, main=paste0("Local MDS, k=",k,", tau=",tau))
text(zero$conf[,1],zero$conf[,2],1:n,pos=3, cex=0.7)

  1. In the previous scatterplot, select a few points (9 points, for instance) located in such a way that they cover the variability of all the points in the scatterplot. Then use the function plot.zip to plot the ZERO digits corresponding to the selected points. The images you are plotting should allow you to give an interpretation of the 2 coordinates obtained by Local MDS (observe how the shape of ZEROs changes when moving along each directions of the scatterplot). By doing this we have seen that the second coordinate from local MDS represents how thick the line drawn is and the first coordinate represents if the circle is more horizontally deformed or vertically deformed.
pos_names =  c("left_top",  "top",   "right_top", "left", "middle", "right",  "left_bot", "bot",  "right_bot")
sel_points = c("638",       "7290",  "5727",      "4808", "4465",   "737"  ,  "6795",     "5697", "3572")
op <- par(mfrow = c(3,3))
for (i in 1:9){
  plot.zip(main=pos_names[i],data[sel_points[i],],use.first = TRUE)
}

Another approach would be to detect 9 clusters in the data configuration obtained by local MDS and use the centroids as points for visualizing the numbers. But with this approach there is not a clear pattern visible as in the approach described above.

# select 9 points that cover the variability by looking for 9 clusters over the data and using their centroids

k=kmeans(zero$conf,9)
plot(zero$conf, col = k$cluster, main="K-Means Clustering on Local MDS")
points(k$centers)

library(FNN)
points=get.knnx(zero$conf, k$centers, 1)
points_x1=names(sort(zero$conf[points$nn.index,][,1]))
layout(matrix(k$cluster[points_x1], nrow = 3, ncol = 3))

for (p in points_x1){
  plot.zip(main=k$cluster[p],data[p,],use.first = TRUE)
}

  1. Use the local continuity meta criteria to select the tuning parameters k and Ï„ in Local MDS for ZERO digits. Then describe graphically the low dimensional configuration corresponding to the optimal parameters. Indication: As tentative values for k use c(5,10,50), and for Ï„ use c(.1,.5,1).
LCMC <- function(D1,D2,Kp){
  D1 <- as.matrix(D1)
  D2 <- as.matrix(D2)
  n <- dim(D1)[1]
  N.Kp.i <- numeric(n)
  for (i in 1:n){
    N1.i <- sort.int(D1[i,],index.return = TRUE)$ix[1:Kp]
    N2.i <- sort.int(D2[i,],index.return = TRUE)$ix[1:Kp]
    N.Kp.i[i] <- length(intersect(N1.i, N2.i))
  }
  N.Kp<-mean(N.Kp.i)
  M.Kp.adj <- N.Kp/Kp - Kp/(n-1)
  
  return(list(N.Kp.i=N.Kp.i, M.Kp.adj=M.Kp.adj))
}
K <- c(5,10,50)
tau <- c(.1,.5,1)
n <- dim(x)[1]
q<-2
D1=x
Kp=10

LC.lmds <- matrix(0,nrow=length(K),ncol=length(tau))
lmds <- array(vector("list",1),dim=dim(LC.lmds))
conf <- stats::cmdscale(x, k=q)

for (i in 1:length(K)){
  for (j in 1:length(tau)){
    lmds[[i,j]] <- lmds(as.matrix(x), init=conf, ndim=q, k=K[i], tau=tau[j], itmax=1000)$conf
    D2 <- dist(lmds[[i,j]])
    LC.lmds[i,j] <- LCMC(D1,D2,Kp)$M.Kp.adj
    #print(c(i,j,LC.lmds[i,j]))
  }
}

ij.max <- arrayInd(which.max(LC.lmds),.dim=dim(LC.lmds))
k.max <- K[ij.max[1]]
tau.max <- tau[ij.max[2]] 
lmds.max <- lmds[[ij.max[1],ij.max[2]]]
#lmds.max <- lmds(as.matrix(x), init=conf, ndim=q, k=5, tau=1, itmax=1000)$conf
plot(K, LC.lmds[,1], type="b", main=paste0("K=",round(k.max,4)))
abline(v=k.max,col=2)

plot(tau, LC.lmds[,2], type="b", main=paste0("tau=",round(tau.max,4)))
abline(v=tau.max,col=2)

After using the LCMC we can observe below the optimal configuration from the given pull of parameters. When comparing it with the scatter plot from the previous configuration, we can see that the aggregation of points is more similar to a straight line. Data representations happen to be more dispersed among the first dimension, while second dimension range seems to be preserved. This means that the first and second dimension expose higher correlation.

print(paste0("k.max=",k.max,"; tau.max=",tau.max))
## [1] "k.max=5; tau.max=1"
plot(lmds.max,as=1, main=paste0("Local MDS, k=",k.max,", tau=",tau.max))
text(lmds.max[,1],lmds.max[,2],1:n,pos=3, cex=.7)

3. ISOMAP for ZERO digits

  1. Look for a 2-dimensional (q=2) configuration of the data using parameter k=5 in function isomap from package vegan. Do the scatterplot of the obtained 2-dimensional configuration.
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
x=dist(as.matrix(data))
q<-2
k <- 5

ismp <- isomap(x,ndim=q, k=k)
plot(ismp,n.col=3,main="Output of ISOMAP Algorithm")
text(ismp$points[,1],ismp$points[,2],rownames(data),pos=3, cex=.5)

  1. In the previous scatterplot, select a few points (9 points, for instance) located in such a way that they cover the variability of all the points in the scatterplot. Then use the function plot.zip to plot the ZERO digits corresponding to the selected points. The images you are plotting should allows you to give an interpretation of the 2 coordinates obtained by ISOMAP (observe how the shape of ZEROs changes when moving along each directions of the scatterplot).

In the following figure we can appreciate the relations between the coordinates defined by ISOMAP and the types of zeroes drawn. We can see that the ones with lower values for dimension 2 are the ones that have lines connecting them to the previous digit of the ZIP code. The ones with high value for dimension 2 are more round and less dense, with less painted pixels.

For the other dimension we can say that the zeros with high value are more round and large and the ones with low values are thin and tall.

It is also visible that the zeros with values around the 0 for dimension 2 are drawn with thick lines.

pos_names =  c("left_top", "top",   "right_top", "left", "middle", "right",  "left_bot", "bot",  "right_bot")
sel_points = c("522",      "5026",  "6989",      "7069", "4774",   "364"  ,  "6659",     "1126", "1884")
op <- par(mfrow = c(3,3))
for (i in 1:9){
  plot.zip(main=pos_names[i],data[sel_points[i],],use.first = TRUE)
}

  1. Use the local continuity meta criteria to select the tuning parameter k in ISOMAP for ZERO digits. Then describe graphically the low dimensional configuration corresponding to the optimal parameter. Indication: As tentative values for k use c(5,10,50).
library(vegan)

k_values <- c(5, 10, 50)
q=2
D1 <- dist(data)
LC.ismp <- numeric(length(k_values))

isomap_results <- vector("list",length(k_values))

for (i in 1:length(k_values)){
   k <- k_values[i]
   isomap_results[[i]] <- isomap(D1, ndim=q, 
                            k=k)
  dist_k <- dist(isomap_results[[i]]$points[,1:q])
  LC.ismp[i] <- LCMC(D1, dist_k, k)$M.Kp.adj
}

i_max <- which.max(LC.ismp)
k_max <- k_values[i_max]
isomap_max <- isomap_results[i_max]

plot(k_values, LC.ismp, type="b", main=paste0("K=",round(k_max,4)))
abline(v=k_max,col=2)

best_k_isomap <- isomap(D1, ndim=q,  k=k_max)
pairs(cbind(best_k_isomap$points[,1], best_k_isomap$points[,2] ,lambda=best_k_isomap$points[,1]), pch=20, main="Best ISOMAP output in 2-dim")

plot(best_k_isomap$points[,1], best_k_isomap$points[,2], pch=20, main="Best ISOMAP output in 2-dim", as=1)
text(best_k_isomap$points[,1], best_k_isomap$points[,2], rownames(data),pos=3, cex=.5)

To explain this low dimensional representation we will use the same method as before. Through this method we have seen that the vertical dimension is correlated with the density of the image, how black the zero is. The first dimension (the horizontal one) seems correlated with the roundness of the zero, being the round zeros the ones with larger values in this dimension.

pos_names =  c("left_top", "top",   "right_top", "left", "middle", "right", "left_bot", "bot",  "right_bot")
sel_points = c("2695",     "5542",  "5911",      "6381", "5626",   "4964",  "6462",     "6890", "4898")
op <- par(mfrow = c(3,3))
for (i in 1:9){
  plot.zip(main=pos_names[i],data[sel_points[i],],use.first = TRUE)
}

4. t-SNE for ZERO digits

You must apply t-SNE to reduce the dimensionality of this dataset using the function Rtsne from package Rtsne.

library(Rtsne)
# help(Rtsne)
  1. Look for a 2-dimensional (q=2) configuration of the data using parameters perplexity=40 and theta=0 in Rtsne function. Do the scatterplot of the obtained 2-dimensional configuration.
x=dist(as.matrix(data))

set.seed(42)
theta= 0.0
perp = 40

tsne_out <- Rtsne(x, pca=FALSE, perplexity=perp, theta=theta)

plot(tsne_out$Y, asp=1, main="t-SNE on ZIP data")
text(tsne_out$Y[,1], tsne_out$Y[,2], rownames(data), pos = 3, cex = .7)

  1. In the previous scatterplot, select a few points (9 points, for instance) located in such a way that they cover the variability of all the points in the scatterplot. Then use the function plot.zip to plot the ZERO digits corresponding to the selected points. The images you are plotting should allows you to give an interpretation of the 2 coordinates obtained by t-SNE (observe how the shape of ZEROs changes when moving along each directions of the scatterplot).

Proceeding as before we see that the zeros at the bottom are the ones having deviations from the usual circle shape. These are the ones that have additional segments, when the line that defines the zero is longer than it should be so it has an external ramification. On the other side, at the top part of the plot there are the zeros whose circle is incomplete, that the line is too short. We could say that the other dimension is related to the vertical or horizontal deformation of the zero but it is not something very clear.

pos_names =  c("left_top", "top",   "right_top", "left", "middle", "right", "left_bot", "bot",  "right_bot")
sel_points = c("5331",     "2001",  "4391",      "5819", "5333",   "4272",  "5532",     "4614", "6890")
op <- par(mfrow = c(3,3))
for (i in 1:9){
  plot.zip(main=pos_names[i],data[sel_points[i],],use.first = TRUE)
}

  1. Use the local continuity meta criteria to select the tuning parameter perplexity in t-SNE for ZERO digits (use q=2 and theta=0). Then describe graphically the low dimensional configuration corresponding to the optimal parameter. Indication: As tentative values for perplexity use c(10,20,40).
require(Rtsne)
set.seed(4444)

D1 <- dist(data)
Kp <- 10
q <- 2
theta = 0

perplexity <- c(10,20,40)

LC.tsne <- numeric(length(perplexity))
Rtsne.k <- vector("list",length(perplexity))

for (i in 1:length(perplexity)){
    Rtsne.k[[i]] <- Rtsne(D1, perplexity=perplexity[i], dims=q,
                          theta=0, pca=FALSE, max_iter = 1000)
    D2.k <- dist(Rtsne.k[[i]]$Y)
    LC.tsne[i] <- LCMC(D1,D2.k,Kp)$M.Kp.adj
}

i.max <- which.max(LC.tsne)
perplexity_max <- perplexity[i.max[1]] 
best_rtsne <- Rtsne.k[[i.max]]

plot(perplexity,LC.tsne, main=paste0("perplexity max=",perplexity_max))
abline(v=perplexity[i.max],col=2)

pairs(cbind(best_rtsne$Y[,1], best_rtsne$Y[,2] ,lambda=best_rtsne$Y[,1]), pch=20, main="Best t-SNE output in 2-dim")

5. Compare Local MDS, ISOMAP and t-SNE for ZERO digits

  1. Compare graphically the dimensions of the 2-dimensional configurations you have obtained by Local MDS, ISOMAP and t-SNE for ZERO digits. Indication: Use the function pairs applied to a 6-dimensional matrix.

With the following figure we see that the first dimension of the three dimensionality reduction methods that we have used is very similar. The plots between first dimensions from different methods show a clear correlation, close to the identity function (or minus the identity function) with some noise. The same effect is also present between the second dimensions of the Local MDS and the ISOMAP.

We can say that Local MDS and ISOMAP give very similar new dimensions, except for a sign.

For the t-SNE the resemblance with the other two second dimensions is not that clear. It is visible that there is a correlation but the noise in this case is way higher, since the scatterplots are not as close to a straight identity line.

pairs(cbind(lmds1=lmds.max[,1], lmds2=lmds.max[,2],ismp1=best_k_isomap$points[,1],ismp2=best_k_isomap$points[,2],tsne1=best_rtsne$Y[,1],tsne2=best_rtsne$Y[,2]))

  1. Which method have produced the 2-dimensional configurations with the largest value of the local continuity meta criteria?

The method that has produced the largest value for LCMC is, by far, t-SNE.

cat("local MDS: ",max(LC.lmds), "\n")
## local MDS:  0.2628071
cat("ISOMAP: ",max(LC.ismp), "\n")
## ISOMAP:  0.3574188
cat("t-SNE: ",max(LC.tsne),"\n")
## t-SNE:  0.5061906